Decoding Data in a CGI

In the last lesson you got a better look at the garbage that is returned by WWW clients from a form. While you could see the data there, it was interspersed with odd characters; lots of "+" and "%" and such. In this lesson we will learn how to extract the information you want from this apparent mess.

Required OSAX

ScriptTools
Tokenize
DecodeURL
DePlus

NOTE: if you have not yet installed these OSAXes, then do it before starting this lesson. The script will not compile without them. Go back to the Requirements section to download the OSAXes if you need them.

Script5.txt - Decoding Data

Here is the entire script for this lesson. The comments have been removed so you see only the lines that actually get compiled. The full script, including comments and special characters, is in the archive with the name "Script5.txt".

property crlf : (ASCII character 13) & (ASCII character 10)
property http_10_header : "HTTP/1.0 200 OK" & crlf & "Server: WebSTAR/1.0 ID/ACGI" & crlf & ¬
   "MIME-Version: 1.0" & crlf & "Content-type: text/html" & crlf & crlf
property idletime : 1800
property datestamp : 0

set datestamp to current date

on «event WWWΩsdoc» path_args ¬
   given «class kfor»:http_search_args, «class post»:post_args, «class meth»:method, «
      class addr»:client_address, «class user»:username, «class pass»:password, «class frmu»:from_user, «
      class svnm»:server_name, «class svpt»:server_port, «class scnm»:script_name, «
      class ctyp»:content_type, «class refr»:referer, «class Agnt»:user_agent, «
      class Kact»:action, «class Kapt»:action_path, «class Kcip»:client_ip, «class Kfrq»:full_request

 try

   set datestamp to current date

   set return_page to http_10_header ¬
      & "<HTML><HEAD><TITLE>Parsed Results</TITLE></HEAD>" ¬
      & "<BODY><H1>Parsed Results</H1>" & return
   set return_page to return_page & "<H4>post_args</H4><PRE>" & return

   set postarglist to tokenize (dePlus post_args) with delimiters {"&"}

   set postargtext to ""   repeat with curritem in postarglist
      set postargtext to postargtext & ¬
         (Decode URL (last text item of currpostarg)) & return & return
   end repeat

   set return_page to return_page & postargtext & "</PRE>" & return
   set return_page to return_page ¬
      & "<HR><I>Results generated at: " & (current date) ¬
      & "</I>" & "</BODY></HTML>"   return return_page

 on error errMsg number errNum
   set return_page to http_10_header ¬
      & "<HTML><HEAD><TITLE>Error Page</TITLE></HEAD>" ¬
      & "<BODY><H1>Error Encountered!</H1>" & return ¬
      & "An error was encountered while trying to run this script." & return
   set return_page to return_page ¬
      & "<H3>Error Message</H3>" & return & errMsg & return ¬
      & "<H3>Error Number</H3>" & return & errNum & return ¬
      & "<H3>Date</H3>" & return & (current date) & return
   set return_page to return_page ¬
      & "<HR>Please notify Mr. Webmaster at " ¬
      & "<A HREF=\"mailto:webmaster@this.site.com\">webmaster@this.site.com</A>" ¬
      & " of this error." & "</BODY></HTML>"   return return_page
 end try
end «event WWWΩsdoc»

on idle
   if (current date) > (datestamp + idletime) then
      quit
   end if
   return 5
end idle

on quit
   continue quit
end quit

Step By Step

There is really only one addition to this script. Instead of just separating the list into its separate items, we now do some decoding on each item. The decoding does two things:

Converts all codes which look like "%XX" to a character, where XX is the hexadecimal code for that character. This means "%20" becomes a space and "%28" becomes an ampersand
Converts all occurences of "+" to spaces. This is only necessary for the NCSA Mosaic and Netscape clients. In case you have a major vision problem and missed my previous comments on this subject, these two clients use the "+" character to encode spaces in text before passing it on to WebSTAR. Can you say "no-no"?

We will use two new OSAXes to do the decoding. The first, DecodeURL, was written by Chuck Shotton (yes, that Chuck Shotton). The second, DePlus, was written by myself using tons of Chuck's original code. Both are free products and major time- and code-savers.

First, we use DePlus on the whole block of the post_args text. This is possible because all of the real "+" characters were encoded by the client. We'll decode those later as part of the actual data. Here is the new line:

   set postarglist to tokenize (dePlus post_args) with delimiters {"&"}

Note how we are running DePlus before we Tokenize by combining the two statements on one line. We could also have written this as:

   set new_post_args to dePlus post_args
   set postarglist to tokenize new_post_args with delimiters {"&"}

That wastes a variable and some processing time, though. The way we do it the output of DePlus is fed right into Tokenize for processing. Note: If you tried to run Tokenize first, then DePlus, it will fail. The output of Tokenize is a list and DePlus requires a text string.

Now we are ready to process all of those special encodings in the data. Here is the code that does this part:

   set postargtext to ""   repeat with curritem in postarglist
      set postargtext to postargtext & ¬
         (Decode URL (last text item of currpostarg)) & return & return
   end repeat

Remember that postarglist is now a list of pairs of the type "name=data". We will take each item in that list in turn and add it to our page to be returned to the client. The repeat loop takes each item from the list in order and assigns the item to the variable curritem. Thus, each time through the loop, curritem contains the next list item.

Decode URL is run on each item before it is added to the page. It works by converting all hexadecimal encodings ("%XX") to their ASCII equivalents. After that, we add this decoded text onto the end of postargtext and add two carriage returns before looping for the next item.

Now for you AppleScript purists, yes, you could do this same processing in AppleScript without using the OSAX. However, unlike in the last lesson, this time we're talking some serious bulk in your script. Here is some sample code that would perform some of the same function as DecodeURL, except it only decodes occurences of "%28" to ampersands:

on decodeAmp(inText)
   set outText to ""
   set ampPos to offset of "%28" in inText
   repeat while ampPos > 0
      if ampPos != 1 then   -- if the ampersand is not the first character in the text
         set outText to (text from character 1 to character (ampPos - 1) of inText ¬
            & "\&" & (text from character (ampPos + 3) to character (length of inText) of inText)
      else
         set outText to "\&" & (text from character (ampPos + 3) to character (length of inText) of inText)
      end if
   end repeat
   return (outText)
end decodeAmp

Even more lines would be required to make it handle all possible encodings. I have better things to do than figure out how to do things more slowly. Of course, if you don't believe me, feel free to write the code yourself. You should be able to time the difference on your wristwatch (or a good hourglass) if you're dealing with large arguments (like >10K of text). On the other hand, maybe you'll just want to take my word for it. That should leave you enough spare time to watch this week's Superman episode.

This looks like a good place to bring up another performance issue. If you remember from the first lesson, there are a number of variables passed to your CGI from WebSTAR. With the exception of post_args and http_search_args, all of the information in these variables is put there by WebSTAR. That means there is no reason to decode the information in these variables, since only data from the client is encoded. In general, the data in post_args is the only thing that will take a lot of processing in your scripts. If you are using the http_search_args to hold information as well, then you may need to decode that information as well.

Test the Script

I have compiled the script above on my own server, with an accompanying form. You will want to try this form out with a variety of WWW clients to show that it handles all current variations on encoding spaces.

Wrap It Up

Now you are ready to really get into the meat of the CGI applications. Everything up to now has been the foundation, the next lesson will begin to return useful information. I only have one item to bring up in reminder after this lesson:
Remember to thank the people who wrote this fine software you are using!
Products such as Tokenize (and the accompanying ACME Suite), DecodeURL, and others are making your life and mine easier and more productive and at a great price. Another good suggestion to remember is give back to the Net. If you are helped by a free product, consider offering something of your own freely to others.

[Go back to Tutorial Index]

Jon Wiederspan
Last Edited: April 26, 1995
Copyright Jon Wiederspan, 1994,1995